Morpheme Segmentation from Distributional Information

نویسندگان

  • Sara Finley
  • Elissa L. Newport
چکیده

Morphology is the study of how meaningful components of form are combined to make complex words. Understanding how such complex words can be ‘broken apart’ into their morphological constituents is the problem of morpheme segmentation. While words that have similar meanings tend to share similar forms (e.g., run and running), many morphemes do not have transparently shared meanings. For example, canning and running are only abstractly related in meaning through the progressive morpheme –ing. Further, sharing phonological material is not sufficient for morphological relatedness. Many words contain overlapping phonetic material without being morphologically related (e.g., words of the same cohort, such as canning and canopy). The question, then, is how learners find morphologically related words in the linguistic input, even when these words may not share common meanings (or before we know what the words mean). The hypothesis that we address in this paper is that learners use the distributional information from the forms they hear in order to infer morphological regularities. For example, many words end in –ing, while many fewer begin with can. It is possible that learners are able to extract this regularity across many words in the language to identify the patterns that characterize the structure of words. In the present study, we expose adult learners to words that fit a stem-affix pattern (stem+suffix). We then test whether learners are able to extract the regularities of the stem-affix pattern through generalization to novel stem-affix combinations. If learners are able to do this without the use of semantic cues, it suggests that learners can form morphological parses from distributional cues. Studies in another area of language learning, focused on word categories and sub-categories, have argued that phonological cues (Brooks, Braine, Catalano, & Brody, 1993; Frigo & McDonald, 1998; Gerken, Wilson, & Lewis, 2005) and semantic cues (Braine, et al., 1990) are highly important aids in learning (MacWhinney, Leinbach, Taranan, & McDonald, 1989; Maratsos & Chalkley, 1980). Most of these studies have suggested that learners have great difficulty in learning categories without phonological or semantic cues, and some have argued that learning categories is impossible without these cues to category structure (Gomez & Gerken, 2000). However, recent evidence shows that learners can use distributional information alone to acquire categories, as long as the distributional regularities are rich enough to support the induction of grammatical classes (Reeder, Newport, & Aslin, 2009). While there is thus evidence that learners can use distributional information to learn categories, previous research on affix learning has focused on learning the meaning of the affix rather than the distribution of forms (Braine, et al., 1990; MacWhinney, 1983). For example, Braine et al. (1990) taught children inflectional locative affixes (e.g., to, from, at) in an artificial grammar learning setting. In this case, learning the form of the affix was dependent on the semantic context associated with the form. However, in natural languages the systematic pattern of affixation is not necessarily dependent of the meaning associated with the affix. There are two reasons why it is important to study how morpheme segmentation can be done without semantics. First, a morpheme is more than just semantics; it involves both form and meaning. One must explain why words that are morphologically related are likely to be phonologically related as well. Understanding how learners can infer morphological relatedness through form relatedness is important for understanding the structure of the lexicon as well as the restrictions on morphological

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bilingual Lexicon Extraction at the Morpheme Level Using Distributional Analysis

Bilingual lexicon extraction from comparable corpora is usually based on distributional methods when dealing with single word terms (SWT). These methods often treat SWT as single tokens without considering their compositional property. However, many SWT are compositional (composed of roots and affixes) and this information, if taken into account, can be very useful to match translational pairs,...

متن کامل

Linguistic Problems In Multilingual Morphological Decomposition

An algorithm for the morphological decomposition of words into morphemes is presented. The application area is information retrieval, and the purpose is to find morphologically related terms to a given search term. First, the parsing framework is presented, then several linguistic decisions are discussed: morpheme selection and segmentation, morpheme classes, morpheme grammar, allomorph handlin...

متن کامل

Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, m...

متن کامل

Discrimination in lexical decision

In this study we present a novel set of discrimination-based indicators of language processing derived from Naive Discriminative Learning (ndl) theory. We compare the effectiveness of these new measures with classical lexical-distributional measures-in particular, frequency counts and form similarity measures-to predict lexical decision latencies when a complete morphological segmentation of ma...

متن کامل

Cross-lingual Word Segmentation and Morpheme Segmentation as Sequence Labelling

This paper presents our segmentation system developed for the MLP 2017 shared tasks on cross-lingual word segmentation and morpheme segmentation. We model both word and morpheme segmentation as character-level sequence labelling tasks. The prevalent bidirectional recurrent neural network with conditional random fields as the output interface is adapted as the baseline system, which is further i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010